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We propose an inferenee method to estimate sparse interaetions and biases aceording to Boltz¬ 
mann machine learning. The basis of this method is Li regularization, which is often used in 
compressed sensing, a technique for reconstructing sparse input signals from undersampled 
outputs. Li regularization impedes the simple application of the gradient method, which op¬ 
timizes the cost function that leads to accurate estimations, owing to the cost function’s lack 
of smoothness. In this study, we utilize the majorizer minimization method, which is a well- 
known technique implemented in optimization problems, to avoid the non-smoothness of the 
cost function. By using the majorizer minimization method, we elucidate essentially relevant 
biases and interactions from given data with seemingly strongly-correlated components. 


1. Introduction 

Because massive amounts of structured and unstructured data continue to accumulate, 
the importance of effective big data analysis is rapidly increasing. One well-known big data 
analysis tool is Boltzmann machine learning. This technique is physics-friendly, because it is 
a form of probability density defined by the Hamiltonian of the Ising model.' We assume that 
the generative model has a bias on each variable, the magnetic field, and the pair-wise interac¬ 
tions between the different variables (i.e., the interaction between adjacent spins). Boltzmann 
machine learning has proven effective, and has stimulated increasing interest in deep leam- 
ing.^^^ Deep learning typically needs large volumes of data for its implementation. Currently, 
this demand is often satisfied because we are in the so-called big data era; however, we require 
hard computation as a return. Thus, the study of Boltzmann machine learning may involve 
constructing a good approximation.^ Otherwise, we require a novel method to achieve ef¬ 
ficient learning, even from a small amount of data. 


*mohzeki@i.kyoto-u. ac.jp 




J. Phys. Soc. Jpn. 


Effective big data analysis can produce a substantial amount of valuable information. An 
objective of this analysis is to elucidate a small number of relevant quantities to describe the 
acquired data, a process known as variable selection. The goal of data-driven science is to 
capture an essential portion of the generative model and to identify the characteristics that 
describe its origin. In order to achieve this goal, sparseness may be imposed on the bias or 
pair-wise interactions of the generative model. One successful approach is to employ the 
regularization of the L\ norm of the bias and pair-wise interactions. However, because of 
the Li norm’s lack of differentiability, the application of the simple gradient method is not 
straightforward. A different method employs a greedy algorithm, which seeks a small number 
of non-zero components satisfying some criteria. Under some conditions, greedy algorithms 
can overcome the L\ regularization.^^’ However, greedy algorithms depend on the properties 
of the parameters to be estimated; moreover, L\ regularization cannot be discarded, because 
it has a wide range of applications and enables us to perform robust inference for various 
models. 

In this study, we resolve the lack of smoothness by implementing a technique for Li 
regularization (often used in optimization studies), namely majorizer minimization.^^’The 
technique reduces a ”many-body” interaction problem to a ”one-body” problem by introduc¬ 
ing the majorizer of the original optimization problem with L\ regularization. This is a type 
of mean-field analysis used in statistical mechanics. We must emphasize that this method 
does not change the optimal solution, and thus yields the exact optimal point under several 
optimized cost function conditions. 

The remaining sections of the paper are organized as follows. In the second section, we 
briefly review Boltzmann machine learning and the recent developments in this area. In the 
third section, we introduce majorizer minimization, and obtain the algorithm to resolve the 
Boltzmann machine learning optimization problem, using Li regularization. In the fourth 
section, we test our method with numerical experiments. In the last section, we summarize 
our study. 

2. Boltzmann machine learning 

We assume that the generative model of the data x e {-1,1}^ takes the form of the Ising 


model as 



( 1 ) 
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where 7,y is a pair-wise interaetion, /?, is a bias, and Z(J, h) is the partition funetion. The 
sets of Jij and hi are denoted as J and h. The number of eomponents is represented by N. 
The summation j s di is ealeulated by summing the adjacent components to one denoted 
by i. Boltzmann machine learning is used to estimate Jij and /?, from snapshots of spin con¬ 
figurations, namely the given data, x® for k = 1,2, • • • ,D by use of the Gibbs-Boltzmann 
distribution of the Ising model as in Eq. (1). The standard method to estimate the parameters 
J and h is the maximum-likelihood estimation'^ as 

f D \ 


{J, hi = arg max 

J,h 


2logE(x®|y,h) 

k=\ 


( 2 ) 


In other words, we minimize the KL divergence between the generative model’s distribution 
and the empirical distribution of the given data defined as 

1 ^ 

/’2)(x) = -^d(x-x®). (3) 

^ k=i 


The minimization of KL divergence 

min KL(E 2 )(x)|E(x|y, h)) = mm P^{x) log (4) 

yields the maximum-likelihood estimation. However, the computational time is excessive, be¬ 
cause the method demands evaluation of the partition function depending on J and h. There¬ 
fore, we require an effective technique to either approximate the partition function or avoid 
the computation of the partition function. 

In the present study, we selected the latter technique. One of the simplest methods to 
mitigate the computation of the log-likelihood function in Boltzmann machine learning is the 
pseudo-likelihood estimation.'^’We change the cost function in the maximum-likelihood 
estimation, which has no terms in common with the partition function, as 

D D N 

Y, log /’(x® 17, h)^Y log n 

k=l k=\ 1=1 


where 


P(xi\J,h,x/i) = 


Z,{J, 


—- exp i V JijXiXj + hiXi 

••I*/'* \U 


( 6 ) 


and 


Z,(7, h|x/;) = ^ exp Y = 2 cosh Y Jij^j + hi ’ (2) 

Xi V jedi / V jedi / 

In the following, we deal with the minimization problem and take the negative of the approx- 
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imated quantity as the cost function, that is 


D 


N 


XpLCy, h) = - log P{x^J, h, xjf). (8) 

k=\ i=\ 

This appears to be a type of mean-field analysis, but the pseudo-likelihood estimation asymp¬ 
totically (large amount of training data) coincides with the maximum-likelihood estimation. 
This method is very simple and easy to implement, but requires an excessive amount of data. 

Another technique for changing the cost function is the minimum probability flow.^° This 
method was inspired by relaxation dynamics, starting from the empirical distribution de¬ 
termined by the given data toward the distribution, using tentative parameters. Relaxation 
dynamics are implemented by a master equation as 

dPt(x) 


dt 


^ lT(x|y)P,(y), 


( 9 ) 


where Vk(x|y) is the transition rate matrix. We impose a one-spin flip at each update and 
detailed balance condition as 


(1 1 ^ 

W(x<'^|x<^^) = exp I -- (^(x^'V, h) - £(x<^V, h)) [ for ^ =N-2, 


( 10 ) 


1=1 


where 


N 


N 


£(x|y, h) = - ^ ^ JijXiXj - ^ hiXi. (11) 

i=l jedi i=l 

The choice of the transition matrix is very important in the following manipulation of the min¬ 
imum probability flow. The maximum likelihood estimation is computationally intractable 
due to the computation of the partition function. We remove the dependence on the partition 
function by choosing the local update rule in the transition matrix as in Eq. (10). For instance, 
the Metropolis method and heat-bath method can be applied to the minimum probability flow. 
In the present study, we follow the original formulation of the minimum probability flow in 
the literature^° for its symmetric form in computation as shown below. If we tune the param¬ 
eters adequately for the empirical distribution of the given data, the change from the initial 
distribution, namely the empirical distribution /’©(x), is expected to be small; otherwise, it be¬ 
comes large. To capture this expectation, we then compute the following infinitesimal change 
of the KL divergence as 


KL(Eo(x)|Edx)) « KL(Eo(x)|Eo(x)) + dt- KL(Eo(x)|E,(x))|,=o • 


( 12 ) 


The combination of elementary algebra and the master equation leads up to the first order of 
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dt as 

KL(Po(x)|P,(x)) « ^ ^ ^ exp (£(x('V, h) - £(x«|7, h))|. (13) 

k=\ l<t:D\ledk ' 

The true parameters are then estimated by minimization of this quantity. This is the mini¬ 
mum probability flow method. Notiee that we do not require to manipulate the Markov ehain 
Monte Carlo (MCMC) simulation, although the method is inspired by stoehastie dynamies. 
This is different from eontrastive divergenee, whieh requires eomputation by MCMC.^^ Onee 
we impose the stoehastie dynamies rule and the detailed balaneed eondition, we immediately 
eompute the above quantity. Thus, we utilize Eq. (13) as the eost funetion to be minimized 
for estimating the parameters, instead of the log-likelihood funetion as in the maximum like¬ 
lihood estimation; that is 


Xmpf(^, h) = ^ y y exp (i (E(x®|7, h) - E(x®|7, h))|, (14) 

^ k=\ HD ^ > 

where the summation over / results in the ease satisfying = N - 2. The perfor- 

manee, estimation preeision, and eomputational effioieney often exeeed those of the pseudo¬ 
likelihood estimation for the same amount of data. In the present study, we employ these 
methods to estimate the parameters; the following diseussion ean be straightforwardly ap¬ 
plied to them. 

Above all, we assume that parameters J and h are assigned to all pairs and all eompo- 
nents. However, in order to elueidate the most relevant pair-wise interaetions and biases from 
the given data, we employ an additional teehnique to prune less signifieant parameters. A 
eandidate is required to utilize the regularization of the Lj norm.^’ Let us then minimize the 
eost funetion X (=Xmpf or Xpl) with Li norm as 

( N \ 


min I Ty E \Jij\ + d/, ^ \hi\ + J1{J, h) > . (15) 

’ I iij) <■=! j 

The regularization teehnique was originally designed to obtain a unique estimation from un¬ 
derdetermined equations by imposing additional eonditions. Therefore, estimations that uti¬ 
lize regularization lead to stable solutions, even from small amounts of data. As eompen- 
sation, the entire eost funetion is not smooth, owing to the existenee of the absolute value 
funetion. The non-smoothness impedes the simple applieation of the gradient method, whieh 
identifies the minimal point of the eost funetion. For the absolute value funetion, we may 
prepare several types of imitating funetions. However, this type of approximation does oe- 
easionally generate ineorreet estimations, and reduees the eonvergenee rate. Instead of the 
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original optimization problem with a non-smooth term, let us utilize a different funetion shar¬ 
ing the same optimal point below, that is the majorizer minimization. 


3. Majorizer minimization 

We briefly review majorizer minimization for eonvenienee. In general, we consider the 
optimization problem by minimizing a convex function / with A^-dimensional variables, 
which is assumed to be differentiable; its derivative V/(x) is Lipschitz. When the derivative 
is Lipschitz, there is a constant L > 0 for any a and b 


N 


k=l 


y\— 


dxk 


\ 2 N 

<LY,{a,-buf 

k=\ 


(16) 


where L is termed as the Lipschitz constant and Uk and bk are the k\h component of N- 
dimensional vectors a and b. The majorizer of the function / is then given by the following 
quadratic function 


N 


L ^ 

(Xk - n) + - ^ (Xk - Vkf ■ 

*='' ^ k=\ 


^(x,v)=/(v)+ y — 

The majorizer always satisfies 

/(x) < g(x, v) < /(v). 

Let us then consider the following optimization problem. 

x'+^ = argmin{g(x,x')}. 

X 

The sequence of the optimal solutions [x°,x', • ■ • ,x^] satisfies 

/(x^"')<g(x'"',xO</(xO 


(17) 


(18) 


(19) 


( 20 ) 


for t = 0,1, • • • ,r - 1. This property of the majorizer gradually approaches the optimal 
solution of the original minimization problem. This technique is referred to as the majorizer 
minimization approach, which is one of the gradient methods. The convergence rate is known 
as /(xO - /(x*) = 6>(l/t), where the asterisk stands for the optimal solution. When we utilize 
the regularization obtained with the Li norm, we solve the following optimization problem 


x'"^* = argmin{g(x,xO + dllxlli}, 

X 


( 21 ) 


where ||x||i = Yuk=\ \^kV Because the majorizer is quadratic and the Li norm is separable, the 
optimal solution can be analytically obtained as 


j+\ 


1 df 


( 22 ) 
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where 


ria{x) = sign(A:)(|x| - a). 


(23) 


Therefore, solving alternative optimization problems is redueed to a simple substitution using 
the tentative solution x^ The majorizer minimization method is broadly used in eompressed 
sensing methods, whieh reeonstruet original inputs from undersampled outputs. In this prob¬ 
lem, the original inputs should be sparse. Li-regularization enforees a sparse solution for the 
inference problem of the original signals. Similarly, let us utilize the majorizer minimiza¬ 
tion method for estimation of the Boltzmann machine learning parameters. Let us remark the 
role of the majorizer in short. The majorizer modifies the original optimization problem into 
quadratic form. The quadratic form separates the dependence on each component. In other 
words, the many-body interaction system with the original function / is changed into a one- 
body independent system consisting of the majorizer. This is a type of mean-field analysis, 
which approximates the many-body interactions into an effective one-body description. In 
statistical mechanics, the law of large numbers is imposed on the number of components N 
to perform mean-field analysis and validation. However, in this method, we do not require a 
large number of components; we only require the property of function /. In this sense, it is a 
very generic yet powerful technique. 

Let us apply the majorizer minimization approach to Boltzmann machine learning with 
Li regularization. Because the pseudo-likelihood function and cost function in the minimum 
probability flow are differentiable and convex,^® the majorizer minimization method can be 
applied. The majorizer for Boltzmann machine learning is given as 


G(/,h';7,h) = L{lh) + Yj 

iU) 


d£(J,h) 


dJi^ 


+ 


E 




dhi 


J,h 


J.h ^ (ij) 

(K - hi) + y ^ - hif , 


where Lj and L/, satisfy 

y ( d£(J, h) 
dJiJ 


(ij) 


d£{J,h) 


A,h 


dJij 


B,h/ 


/ d£(J, h) 

Zu[~dh~ 


d£{J,h) 


J,Sk 


dhi 


J,b 


<Lj^ (Aij - 

iij) 

< Lh ^ (ai - bi)^ . 


(24) 


(25) 


(26) 


Following the prescription of the majorizer minimization approach, let us iteratively solve the 
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optimization problem 




arg 


\'fij\ + ‘^h 'y \hi\ + G{J, h, y , h ) i . 

’ I (ij) i ) 


(27) 


Because the dependence of J and h on the majorizer is separate, we independently solve the 
optimization problem for each parameter as 


J‘ 


f+i 




h 


f+i 


^hlU [h'i + 


1 ^X(7,h) 

Lj dJij 

1 d£iJ,h) 


J'M'J 


dhi 


J'M' 


(28) 


(29) 


The majorizer minimization method is a generic technique for reaching a minimum point 
by recursive manipulation, under the assumption that the cost function is convex and its 
derivative is Lipschitz. These conditions are satisfied in the cost functions of the pseudo¬ 
likelihood function and minimum probability flow. The derivatives of the pseudo-likelihood 
function yield 

5XpL(/,h) 

= oZ.-'i V “■■■■' 

k=\ k=\ r=l Kjedi 


dJij 


dXpMh) 

dhi 


k=l k=l i=l \jedi 

= +*. 

k=\ k=]. i=i \jedi 


(30) 


(31) 


In these cases, it is difficult to compute the Lipschitz constant. We may use the backtracking 
technique, in which we gradually tune Lj and Lh by some rule such that 

i;PL(/^',h'+i) < (32) 

In addition, the case of the minimum probability flow is evaluated as 

d£MFFiJ,h) ^ ( 33 ) 


dJi 


k=\ HT) 
D 


= 7£j;(x®-x;'')exp{i(E(x®|J,h)-E(x'''|J,h))}. (34) 


- 1.1 m 

These gradients are reduced for one-spin flips, using Ei/Ii = N - 2 

D N ( \ 


d£M¥v{J-, h) 


dJij 


d£M¥vi.£ h) 


" D EZZ 1 E 


dhi 


^ E E ^''p E 

jedi 


k=\ i=\ jedi 


D N 


[^nedi 


.(k) 


(35) 


(36) 


k=\ i=l jedi y 

where we assume that fth spin is flipped from the kth spin configuration (this is the Zth con- 
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figuration in the summation in Eqs. (33) and (34)). Similarly, we may use the backtraeking 
teehnique such that Lj and L/, hold 

i;MPF(/^',h'^') < G(/^',h'^'|/,hO. (37) 


An acceleration technique is available for the majorizer minimization method.^® We modify 
the update rule into 




J\'r - vajilj 




Lj dJi, 


IJ 


+ 


J'.h'A 




, 1 d£iJ,h) 

riA,iL„ I / + 


J'M< 


+ 


A-1 
A+i 

A-i 

Pt+\ 


^Aj/Lj 


( 1 d£(J,h) 

\ 

\ 

jt 

^ Lj dJtj 

y',hv 

/ 


(38) 


(39) 


where 


A+i - 


1 + 

2 


(40) 


The initial condition is jSo = 1- In this update rule, the convergence speed is improved as 
2 2 

2(0) (4 - 7* ) and 2/ [h‘. - h*^ ~ 0(1/t^), where the asterisk denotes the optimal solution. 


4. Numerical test 

We conducted several numerical experiments to test the estimation of sparse interactions. 
The spin configurations were generated from the Markov chain Monte Carlo simulations. 
The linear size Ni = 5', that is, the entire spin N = Nl = 25. The number of interactions was 

= 625; the number of biases was N = 25. The true parameters for the biases were given 
by a Gaussian distribution with zero mean and unit variance. In contrast, the true parameters 
for the interactions were restricted to (i) the random sparse pairs (the non-zero interactions is 
restricted to 10% of all pairs, namely 62) and (ii) the nearest neighboring pairs on the square 
lattice (the number of non-zero interactions 100). We assumed that the interactions should 
be symmetric, namely = 7y,. The values for the interactions used random variables that 
follow a Gaussian distribution with zero mean and unit variance. 

The estimation had no prior knowledge of the structure of J and h. In other words, the 
estimator did not know the lattice, and did not know that the non-zero interaction was re¬ 
stricted to specific pairs. For each method, we estimated the parameters while changing D as 
D = 100,500, 1000, 2000, 3000, and 5000. The optimal selection of the coefficient A could 
not be known a priori. We then tested several values of A for the estimations of the parameters. 
In Fig. 1, we show the averaged performance over 100 samples after 200 iterations for the 
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Fig. 1. (Color online) Average performance of Li-regularized inference in the case of random sparse inter¬ 
actions. The horizontal axis denotes the amount of data. The vertical axis stands for the summation of the 
errors on estimations of J and h, Erry + Err;,, which are defined as Erry = V 2(0) jfj and 

Err* = iyhi - The data amounts were D = 100 (magenta), D - 500 (yellow), D - 1000 

(cyan), D — 2000 (red), D - 3000 (green), and D — 5000 (blue) from top to bottom. 



5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 


Fig. 2. (Color online) Profile (absolute value) of the pair-wise interactions in the random sparse case (one 
example). The left panel shows the original configuration of the pair-wise interactions. The center panel shows 
the results of the pseudo-likelihood estimation (A - 0.2); the right panel shows the results of the minimum 
probability flow (T = 0.02). 


pseudo-likelihood estimation, and 50 iterations for the minimum probability flow, for a case 
in which the pair-wise interactions were distributed randomly. We note that the convergence 
speed of the minimum probability flow was significantly faster than the pseudo-likelihood 
estimation, although the precision of the convergent solutions was comparable. The numbers 
of iterations used in both methods were sufficient to obtain the convergent estimations. Both 
of the methods could estimate the correct values of the biases and interactions. In Fig. 2, we 
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Fig. 3. (Color online) Comparison of the pair-wise interactions and biases to the true parameters in random 
sparse interactions (one example). The vertical axis denotes the true parameters and the horizontal axis stands for 
the estimated values. The upper left panel shows the results for Jij in the pseudo-likelihood estimation (A - 0.2) 
and the upper right one shows that of the minimum probability flow (d = 0.02). The lower panels describe the 
results for /r, by the pseudo-likelihood estimation (left) and the minimum probability flow (right). The red lines 
have a unit slope as a guide to the eye. 


show the profile of the estimated interaetions for a single sample. We eonfirmed that the esti¬ 
mation of the non-zero interaetions had been aehieved, although their absolute values tended 
to be smaller than the original values. This is a eharaeteristie property of the Li regularization. 
We compared the pair-wise interactions and biases to the true parameters, as shown in Fig. 
3. We observe a fairly good performance for the nonzero components of the pair-wise inter¬ 
actions and biases. The zeros of the pair-wise interactions are obtained as extremely small 
valued estimations. We may set some thresholds to prune the irrelevant interactions in the es¬ 
timation. Figure 4 shows the performance averaged over 100 samples after 200 iterations of 
the pseudo-likelihood estimation, and 50 iterations of the minimum probability flow for a case 
in which pair-wise interactions were set on the square lattice. An increase in D improved the 
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Fig. 4. (Color online) Average performance of the Li-regularized inference for a case in which a square 
lattice was used (one example). The axes are the same as those in Fig. 2. In this case, we further investigated 
the dependence on the amount of given data D. The data amounts were D = 100 (magenta), D - 500 (yellow), 
D — 1000 (cyan), D - 2000 (red), D = 3000 (green), and D — 5000 (blue) from top to bottom. 


precision of the estimation in both methods. Both methods could lead to precise estimations 
of the pair-wise interactions and biases. The profile of the estimated interactions is shown 
in Fig. 5. A comparison of the estimated interactions and biases with the true parameters is 
shown in Fig. 6. We emphasize that the estimator did not have any prior knowledge of the 
structure of the interactions. In this sense, we have succeeded in deriving the relevant struc¬ 
ture of the pair-wise interactions from a type of microscopic degrees of freedom snapshot. 
This indicates that the microscopic behavior observation characterized the generative model 
through the estimation, by use of Li regularization. In addition, we truncated insignificant 
parameters with the aid of Li regularization. In both cases of the random sparse interactions 
and the square lattice, we succeeded in reproducing the structure of the pair-wise interactions 
and estimating the magnitude of the interactions. We emphasize that the gradient method with 
majorizer minimization method was replaced by the simple iterative substitution. The tech¬ 
nique we showed is expected to be applied to wide range of applications to seek the relevant 
interactions and biases generating the data. In these numerical experiments, we demonstrate 
the case when we intend to apply our technique to the actual data. Thus we prepare the spe¬ 
cific pair-wise interactions a priori and generate the numerous data. To further investigate the 
precision of our method, the hyperparameters Aj and Ah may be assumed to be distributed 
following the hyperprior distribution. As shown above, we would find the least square error 
in the optimal hyperparameters, which correspond to the distributed ones. 
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Fig. 5. Profile (absolute value) of the pair-wise interactions for a case in which a square lattice was used 
(one example). The left panel shows the original configuration of the pair-wise interactions. The center panel 
describes the estimation derived by the pseudo-likelihood estimation (A - 0.1) and the right panel shows the 
estimation derived by the minimum probability flow (A - 0.018). 


5. Summary 

In this study, we analyzed Boltzmann maehine learning in terms of pseudo-likelihood 
estimation and minimum probability flow. In order to elueidate the most relevant parame¬ 
ters generating the data, we sought a sparse solution in the present study. This task was very 
important for determining the strueture of the data while pruning irrelevant parameters. Li 
regularization was beneficial in obtaining a sparse solution by solving a given cost function. 
However, in general, the non-smoothness of the Li norm hampered the direct manipulation of 
the gradient method, which is intended to minimize the cost function. This study featured the 
implementation of the majorizer minimization method into the Boltzmann machine learn¬ 
ing technique. The majorizer minimization method is a type of mean-field analysis, which 
enabled us to express a many-body interacting system in terms of an effective one-body in¬ 
dependent system. 

We tested our method to elucidate the randomly distributed interactions, and those be¬ 
tween the adjacent spins on the square lattice, without any prior knowledge. The perfor¬ 
mance of our method is fairly satisfactory, as expected. Increasing the amount of given data 
improved the precision of the estimations and enhanced the efficacy of the Li regularization. 
In present study, the cost functions are given by the pseudo likelihood function as well as 
the minimum probability flow. The former one is generalized to the composite pseudo likeli¬ 
hood function inspired by the cluster variational method.In this kind of generalization, the 
majorizer minimization is applicable. In this sense, our scheme is very flexible. 

Notice that our numerical experiments were assumed to be an extremely generic case, 
that is with in homogenous pair-wise interactions and biases. One might intend to infer the 
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Fig. 6. Comparison of the pair-wise interactions and biases to the true parameters for a case in which a square 
lattice was used. The symbols and axes are the same as those in Fig. 3 


homogeneous property from the given data. The necessary number for precise estimations 
should then be extremely reduced. The recent study improves precision of the Boltzmann 
machine learning with the comparable number of the data by aid of the Belief propagation to 
estimate the average and variance from the empirical data.^^ We anticipate that future studies 
will apply our present method to actual observed data, to elucidate the essential property from 
nature. 
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